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Abstract. In this paper we propose an algorithm for polynomial-time re- 
inforcement learning in factored Markov decision processes (FMDPs). The 
factored optimistic initial model (FOIM) algorithm, maintains an empirical 
model of the FMDP in a conventional way, and always follows a greedy policy 
with respect to its model. The only trick of the algorithm is that the model is 
initialized optimistically. We prove that with suitable initialization (i) FOIM 
converges to the fixed point of approximate value iteration (AVI) ; (ii) the num- 
ber of steps when the agent makes non-near-optimal decisions (with respect to 
the solution of AVI) is polynomial in all relevant quantities; (iii) the per-step 
costs of the algorithm are also polynomial. To our best knowledge, FOIM 
is the first algorithm with these properties. This extended version contains 
the rigorous proofs of the main theorem. A version of this paper appeared in 
ICML'09. 



1. Introduction 

Factored Markov decision processes (FMDPs) are practical ways to compactly 
formulate sequential decision problems — provided that we have ways to solve them. 
When the environment is unknown, all effective reinforcement learning methods ap- 
ply some form of the "optimism in the face of uncertainty" principle: whenever the 
learning agent faces the unknown, it should assume high rewards in order to en- 
courage exploration. Factored optimistic initial model (FOIM) takes this principle 
to the extreme: its model is initialized to be overly optimistic. For more often vis- 
ited areas of the state space, the model gradually gets more realistic, inspiring the 
agent to head for unknown regions and explore them, in search of some imaginary 
"Garden of Eden" . The working of the algorithm is simple to the extreme: it will 
not make any explicit effort to balance exploration and exploitation, but always 
follows the greedy optimal policy with respect to its model. We show in this paper 
that this simple (even simplistic) trick is sufficient for effective FMDP learning. 

The algorithm is an extension of OIM (optimistic initial model) [?], which is a 
sample-efficient learning algorithm for flat MDPs. There is an important difference, 
however, in the way the model is solved. Every time the model is updated, the 
corresponding value function needs to be re-calculated (or updated) For flat MDPs, 
this is not a problem: various dynamic programming-based algorithms (like value 
iteration) can solve the model to any required accuracy in polynomial time. 

The situation is less bright for generating near-optimal FMDP solutions: all 
currently known algorithms may take exponential time, e.g. the approximate pol- 
icy iteration of [?] using decision-tree representations of policies, or solving the 
exponential-size flattened version of the FMDP. If we require polynomial running 
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time (as we do in this paper in search for a practical algorithm), then we have 
to accept sub-optimal solutions. The only known example of a polynomial-time 
FMDP planner is factored value iteration (FVI) [?], which will serve as the base 
planner for our learning method. This planner is guaranteed to converge, and the 
error of its solution is bounded by a term depending only on the quality of function 
approximators. 

Our analysis of the algorithm will follow the established techniques for analyzing 
sample-efficient reinforcement learning (like the works of [?, ?, ?, ?, ?] on flat MDPs 
and [?] on FMDPs). However, the listed proofs of convergence rely critically on 
access to a near-optimal planner, so they have to be generalized suitably. By 
doing so, we are able to show that FOIM converges to a bounded-error solution in 
polynomial time with high probability. 

We introduce basic concepts and notations in section [21 then in section [3] we re- 
view existing work, with special emphasis to the immediate ancestors of our method. 
In sections [4] and [5] we describe the blocks of FOIM and the FOIM algorithm, re- 
spectively. We finish the paper with a short analysis and discussion. 

2. Basic concepts and notations 

An MDP is characterized by a quintuple (X, A, R, P, 7), where X is a finite set 
of states; A is a finite set of possible actions; R : X x A — ► K is the reward function 
of the agent; P : X x A x X — > [0, 1] is the transition function; and finally, 7 G [0, 1) 
is the discount rate on future rewards. A (stationary, Markov) policy of the agent 
is a mapping 7T : X x A — > [0, 1]. The optimal value function V* : X — > K gives 
the maximum attainable total rewards for each state, and satisfies the Bellman 
equation 

(1) y*(x) =max^P(y | x, a) (iZ(x, a) + jV * (y)) . 

y 

Given the optimal value function, it is easy to get an optimal policy: 7r*(x, a) := 1 
iff a = arg max a J2 y P(y I x ; a ) a ) + 7^* (y)J an d otherwise. 

2.1. Vector notation. Let N :— |X|, and suppose that states are integers from 
1 to N, i.e. X = {1, 2, . . . , N}. Clearly, value functions are equivalent to TV- 
dimensional vectors of reals, which may be indexed with states. The vector corre- 
sponding to V will be denoted as v and the value of state x by v x . Similarly, for 
each a let us define the TV-dimensional column vector r a with entries = -R(x, a) 
and N x N matrix P a with entries P£ = P(y \ x, a). 

The Bellman equations can be expressed in vector notation as v* = max ae a (r a + 
7P a v*j, where max denotes the componentwise maximum operator. The Bellman 
equations are the basis to many RL algorithms, most notably, value iteration: 

(2) Vt+i := max„ e/ i (r a + 7 P a v t ) , 
which converges to v* for any initial vector Vo. 

2.2. Factored structure. We assume that X is the Cartesian product of m smaller 
state spaces (corresponding to individual variables): 



X = X 1 x X 2 x . . . x X, 
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For the sake of notational convenience we will assume that each Xi has the same 
size, |Xi| = \X%\ = . . . = \X m \ = n. With this notation, the size of the full state 
space is TV = |X| = n m . We note that all derivations and proofs carry through to 
different size variable spaces. 

Definition 2.1. For any subset of variable indices Z C {1, 2, . . . , m}, let X[Z] := 

x Xi, furthermore, for any x e X, let x[Z] denote the value of the variables with 

iez 

indices in Z. We shall also use the notation x[Z] without specifying a full vector of 
values x ; in such cases x.[Z] denotes an element in X[Z]. For single- element sets 
Z = {i} we shall also use the shorthand x[{«}] = x[«]. 

Definition 2.2 (Local-scope function). A function f is a local-scope function if it 
is defined over a subspace X[Z] of the state space, where Z is a (presumably small) 
index set. 

If \Z\ is small, local-scope functions can be represented efficiently, as they can 
take only different values. 

Definition 2.3 (Extension). For f : X[Z] — > R be a local-scope function. Its 
extension to the whole state space is defined by /(x) := f(pc[Z]). The extension 
operator for Z is a linear operator with a matrix Ei z i £ Rl x l x l x I z ]l ; with entries 



( j? \ J 1, ifu[Z}=v[Z}; 

V [z\)u, v [z] \ 0, otherwise. 



For any local-scope function f with a corresponding vector representation f £ 
R' x ' z ]' xl 7 E[z]t € R' x ' xl is the vector representation of the extended function. 

We assume that the reward function is the sum of J local-scope functions with 
scopes Zf R(x,a) = J2j=i a). In vector notation: r a = J2j=i E[z a ] r t ■ 

We also assume that for each variable i there exist neighborhood sets I\ such that 
the value of x t+ i[i] depends only on x t [Tj] and the action at taken. Then we can 
write the transition probabilities in a factored form 

m 

(3) P(y|x,a)=n^(yW |x[r 4 ],a) 

i=l 

for each x, y £ X, a e A, where each factor is a local-scope function Pi : X[I\] x 
A x Xi — ► [0, 1] (for alH £ {1, . . . , m}). In vector/matrix notation, for any vector 
v £ Rl x l xl , P a v = 0™ 1 (P i a v[r i ]), where (g) denotes the Kronecker product. 
Finally, we assume that the size of all local scopes are bounded by a small constant 
to/ to: \Ti\ < rnj for all i. As a consequence, all probability factors can be 
represented with tables having at most Nf := n mf rows. 

An FMDP is fully characterized by the tuple M = ({X^^-A; {Z 3 }j =1 ; {Rj}j =1 ; {Ti}^; {P}™ x ; x s ; 7) 



3. Related literature 

The idea of representing a large MDP using a factored model was first proposed 
by [?] but similar ideas appear already in the works of [?, ?]. 
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3.1. Planning in known FMDPs. Decision trees (or equivalently, decision lists) 
provide a way to represent the agent's policy compactly. [?] and [?, ?] present 
algorithms to evaluate and improve such policies, according to the policy iteration 
scheme. Unfortunately, the size of the policies may grow exponentially even with a 
decision tree representation [?, ?]. 

The exact Bellman equations ([1]) can be transformed to an equivalent linear 
program with N variables and N ■ \A\ constraints. In the approximate linear pro- 
gramming approach, we approximate the value function as a linear combination of 
K basis functions, resulting in an approximate LP with K variables and N ■ \A\ 
constraints. Both the objective function and the constraints can be written in com- 
pact forms, exploiting the local-scope property of the appearing functions. [?] show 
that the maximum of exponentially many local-scope functions can be computed 
by rephrasing the task as a non-serial dynamic programming task and eliminating 
variables one by one. Therefore, the equations can be transformed to an equivalent, 
more compact linear program. The gain may be exponential, but this is not nec- 
essarily so in all cases. Furthermore, solutions will not be (near-)optimal because 
of the function approximation; the best that can be proved is bounded error from 
the optimum (where the bound depends on the quality of basis functions used for 
approximation) . 

The approximate policy iteration algorithm [?, ?] also uses an approximate LP 
reformulation, but it is based on the policy-evaluation Bellman equations. Policy- 
evaluation equations are, however, linear and do not contain the maximum operator, 
so there is no need for a costly transformation step. On the other hand, the algo- 
rithm needs an explicit decision tree representation of the policy. [?] has shown that 
the size of the decision tree representation can grow exponentially. Furthermore, 
the convergence properties of these algorithms are unknown. 

Factored value iteration [?] also approximates the value function as a linear com- 
bination of basis functions, but uses a variant of approximate value iteration: the 
projection operator is modified to avoid divergence. FVI converges in a polynomial 
number of steps, but the solution may be sub-optimal. The error of the solution 
has bounded distance from the optimal value function, where the bound depends 
on the quality of function approximation. As an integral part of FOIM, FVI is 
described in detail in Section 14.11 



3.2. Reinforcement Learning in FMDPs. In the reinforcement learning set- 
ting, the agent interacts with an FMDP environment with unknown parameters. 
In the model-based approach, the agent has to learn the structure of the FMDP 
(i.e., the dependency sets Ti and the reward domains Zj), the transition probability 
factors Pi and the reward factors Rj. 

Unknown transitions. Most approaches assume that the structure of the 
FMDP and the reward functions are known, so only transition probabilities need to 
be learnt. Examples include the factored versions of sample-efficient model-based 
RL algorithms: factored E 3 [?], factored R-max [?], or factored MBIE [?]. All 
the abovementioned algorithms have polynomial sample complexity (in all relevant 
task parameters), and require polynomially many calls to an FMDP-planner. Note 
however, that all of the mentioned approaches require access to a planner that 
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is able to produce e-optimal solution^ ~ and to date, no algorithm exists that 
would accomplish this accuracy in polynomial time. [?] also present an algorithm 
where exploration is guided by the uncertainties of the linear programming solution. 
While this approach does not require access to a near-optimal planner, no formal 
performance bounds are known. 

Unknown rewards. Typically, it is asserted that the rewards can be approx- 
imated from observations analogously to transition probabilities. However, if the 
reward is composed of multiple factors (i.e., J > 1), then we can only observe the 
sums of unknown quantities, not the individual quantities themselves. To date, we 
know of no efficient approximation method for learning factored rewards. 

Unknown structure. Few attempts exist that try to obtain the structure of 
the FMDP automatically. [?] present a method that learns the structure of an 
FMDP in polynomial time (in all relevant parameters). 

4. Building blocks of FOIM 

We describe the two main building blocks of our algorithm, factored value iter- 
ation and optimistic initial model. 

4.1. Factored value iteration. We assume that all value functions are approx- 
imated as the linear combination of K basis functions h k : X — * M: V(x) = 
J2k=i w k h k (-x). 

Let H be the N x K matrix mapping feature weights to state values, with 
entries i? x ,fc = ^fc(x), and let G be an arbitrary K x N linear mapping projecting 
state values to feature weights. Let w 6 M. K denote the weight vector of the 
basis functions. It is known that if ||^fG?|| < 1, then the approximate Bellman 
equations w x = Gmax„ £ A(r° +^P a Hw x ) have a unique fixed point solution w x , 
and approximate value iteration (AVI) 

(4) w t+ i := Gmax aeA (r a + 1 P a Hw t ) 
converges there for any starting vector Wo. 

Definition 4.1. Let the AVI-optimal value function be defined as v x = _ffw x . 

As shown by [?], the distance of AVI-optimal value function from the true opti- 
mum is bounded by the projection error of v*: 

(5) l|v x -v*|| < yJ— \\HGv* — V* II . 

\ / || Hoc — 1— -y I Moo 

We make the further assumption that all the basis functions are local-scope ones: 
for each k € {1, . . . , K}, h k : X[C k ] -> K, with feature matrices H k € Rl*[cy |xk 
The feature matrix H can be decomposed as H = Y^k=\ E[c k ]Hk- 

Definition 4.2. For any matrices H and G, let the row-normalization of G be a 
matrix N(G) of the same size as G, and having the entries [A/"(G)]fc, x = \\[hg] '^ \\ — • 

Throughout the paper, we shall use the projection matrix G = Af(H T ). 
The AVI equation ((4]) can be considered as the product of the K x N matrix G 
and an N x 1 vector v* = max a£j 4(r a + r yP a Hwt) ■ Using the above assumptions 



The assumption of [?] is slightly less restrictive: they only require that the value of the 
returned policy has value at least pV* with some p < 1. However, no planner is known that can 
achieve this and cannot achieve near-optimality. 
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and notations, we can see that for any x S X, the corresponding columm of G and 
the corresponding element of v t can be computed in polynomial time: 

1 K 
[GKx = \\\HHT] kAoo E M *Mc k ,] ; 

J if 
[v t ] x = m^x[^[r«] x[ ^ ]+7 ^i? [rucJ ((g)^)(h feWfe , t ) 

i=i fe=i iec k 

Factored value iteration draws Ni -C iV states uniformly at random, and per- 
forms approximate value iteration on this reduced state set. 

Theorem 4.3 ([?]). Suppose that G = Af(H T ) For any e > 0, S > 0, if the 

sample size is N\ = 0(^- log ?f), then with probability at least 1 — 8, factored value 
iteration converges to a weight vector w such that ||w — w x || QO < e. In terms of 
the optimal value function, 

(6) ||v x -v*|| co < T A_|| J ffGv*-vl co+e . 

4.2. Optimistic initial model for fiat MDPs. There are a number of sample- 
efficient learning algorithms for MDPs, e.g., E3, Rmax, MBIE, and most recently, 
OIM. The underlying principle of all these methods is similar: they all maintain 
an approximate MDP model of the environment. Wherever the uncertainty of 
the model parameters is high, the models are optimistic. This way, the agent is 
encouraged to explore the unknown areas, reducing the uncertainty of the models. 

Here, we shall use and extend OIM to factored environments. In the OIM al- 
gorithm, we introduce a hypothetical "garden of Eden" (GOE) state Xe, where 
the agent gets a very large reward Re and remains there indefinitely The model 
is initialized with fake experience, according to which the agent has experienced 
an (x,ci,Xe) transition for all x € X and a e A. According to this initial model, 
each state has value Re/(1 — 7), which is a major overestimation of the true val- 
ues. The model is continuously updated by the collected experience of the agent, 
who always takes the greedy optimal action with respect to its current model. For 
well-explored (x, a) pairs, the optimism of the model vanishes, thus encouraging 
the agent to explore the less-known areas. 

The reason for choosing OIM is twofold: (1) The optimism of the model is 
ensured at initialization time, and after that, no extra work is needed to ensure the 
optimism of the model or to encourage exploration. (2) Results on several standard 
benchmark MDPs indicate that OIM is superior to the other algorithms mentioned. 

5. Learning in FMDPs with an Optimistic initial model 

Similarly to other approaches, we will make the assumptions that (a) the de- 
pendencies are known, and (b) the reward function is known, only the transition 
probabilities need to be learned. 

5.1. Optimistic initial model for factored MDPs. During the learning pro- 
cess, we will maintain approximations of the model, in particular, of the transition 
probability factors. We extend all state factors with the hypothetical "garden of 
Eden" state Xe- Seeing the current state x and the action a taken, the transition 
model should give the probabilities of various next states y. Specifically, the ith 
factor of the transition model should give the probabilities of various yt values, 
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given x[I\] and a. Initially, the agent has no idea, so we let it start with an overly 
optimistic model: we inject the fake experience to the model that taking action a 
in x[I\] leads to a state with ith component yi = xe- This optimistic model will 
encourage the agent to explore action a whenever its state is consistent with xfr^]. 
After many visits to (x[Fj],a), the weight of the initial fake experience will shrink, 
and the optimistic belief of the agent (together with its exploration-boosting effect) 
fades away. However, by that time, the collected experience provides an accurate 
approximation of the Pi(yi | x[Fj],a) values. 

So, according to the initial model (based purely on fake experience), 



P(y|x,o) = 



1, H y = (x E , ■ ■ ■ ,x E ); 
0, otherwise, 



i?(x, a) — c ■ i?E, if c components of x are xe- This model is optimistic indeed, 
all non-GOE states have value at least jRe/(1 — 7). Note that it is not possible 
to encode the i?E-rewards for the GOE states using the original set of reward 
factors, so for all state factor i, we add a new reward factor with local scope Xf. 

R'-.XiXA^ R, defining R'Ax, a) = \ f E ' '[f = X . E] With this modification, 
1 ty ' [ v>, otherwise. 

we are able to fully specify our algorithm, as shown in the pseudocode below. 



Algorithm 1 Factored optimistic initial model, 
input: 

M = ({Xt}? ■,A-{Z j }i-{R j }i-{T i }?-{P i }?-x s - 1 ) 
{fffc}f ; {C fc }f ; e > 0; 6 > 0; R E 
initialization: 
t := 0; 

for all i, add GOE states: Xi := Xi U {xe} 
for all i, add GOE reward function r- 
for all i, a, xpi], y E X{\ {xe}, let 
TransitionCount{x[Ti], a, y) := 0; 
TransitionCount(x[Ti], a, xe) ■= 1; 
VisitCount(x[Ti],a) := 1; 
repeat 

r pa] TransitionCounti{it\r i].a.y) 

r » Jx[r,]^ :— VisitCounti{-x.[Ti],a) 

w f := FactoredValueIteration(M,{P^},e,5) 

update TransitionCount and VisitCount corresponding to transition 
(x t ,a t ,x t+ i). 
t:=t + l 
until interaction lasts 



5.2. Analysis. Below we prove that FOIM gets as good as possible. What is 
"as good as possible"? We clearly cannot expect better policies than the one the 
planner would output, were the parameters of the FMDP known. And because 
of the polynomial-running-time constraint on the planner, it will not be able to 
compute a near-optimal solution. However, we can prove that FOIM gets e-close 
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to the solution of the planner (which is AVI- near-optimal if the planner is FVI), 
except for a polynomial number of mistakes during its run0 

Theorem 5.1. Suppose that an agent is following FOIM in an unknown FMDP, 
where all reward components jail into the interval [0, R max ], there are to state fac- 
tors, and all probability- and reward-factors depend on at most mf factors. Let 
Nf = n" lf and let e > and S > 0. If the initial values of FOIM satisfy 



R E =c- 



(l- 7 )i 



8 (l-7)e<5 



then the number of timesteps when FOIM makes non-AVI-near- optimal moves, i.e., 
when Q FOIM (x t ,a t ) < Q x (x t , a t ) — e , is bounded by 

0( R \^-T l V 1 log 2 ^) 
with probability at least 1 — S. 

Proof sketch. The proof uses standard techniques from the literature of sample- 
efficient reinforcement learning. Most notably, our proof follows the structure of [?] . 
There are two important differences compared to previous approaches: (1) we may 
not assume that the planner is able to output a near-optimal solution, and (2) FOIM 
may make an unbounded number of model updates, so we cannot make use of the 
standard argument that "we are encountering only finitely many different models, 
each of them fails with negligible probability, so the whole algorithm fails with 
negligible probability" . Instead, a more careful analysis of the failure probability is 
needed. The rigorous proof can be found in the appendix. 

5.2.1. Boundedness of value functions. According to our assumptions, all rewards 
fall between and i? m ax- From this, it is easy to derive an upper bound on the 
magnitude of the AVI-optimal value function v x . The bound we get is ||v x || < 
fE^Knax : = Vq. For future reference, we note that V — 9( (f"")2 )■ 

5.2.2. From visit counts to model accuracy. The FOIM algorithm builds a transi- 
tion probability model by keeping track of visit counts to state-action components 
(xfTj], a) and state-action-state transition components (x[T<], a, y). First of all, we 
show that if a state-action component is visited many times, then the corresponding 
probability components Pt i(j/|x[Tj], a) become accurate. 

Let us fix a timestcp t G N, a probability factor i E {1, . . . , to} and a state-action 
component (x[Fi],a) £ X[Fj] x A, and tt > 0. Let us denote the number of visits 
to the component up to time t by fc t (x[Fi],a). Let us introduce the shorthands 
Pi = Pi(y\x-[Fi], a) and p ty i = Pt,i(y\x[Ti], a). By Theorem 3 of [?] (an application 
of the Hocffding-Azuma inequality), 



(7) Pr(£ \Pi(-pt,i\>e t ) <2"exp(- 



e?fc t (x[r.,],q) 
2 



vex. 



Unfortunately, the above inequality only speaks about a single time step i, but we 
need to estimate the failure probability for the whole run of the algorithm. By the 



2 We are using the term polynomial and polynomial in all relevant quantities as a shorthand 
for polynomial in m, Nf, \A\, i? ma x, 1/(1 — 7)> 1/e an< ^ 
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union bound, that is at most 



oo 

(8) E Pr (^ \Pi-Pt k ,i\ >e tl 

k=l y€Xi 

Let ko := O ( (i^yi^i log ™i-jlse ) ■ F° r k < ko, the number of visits is too low, 
so in eq. 0, either ei, or the right-hand side is too big. We choose the former: 
we make the failure probability less than some constant 8' by setting e tk = , 

where /3(8') = ^2{\og jr + nlog2). For k > k , the number of visits is sufficiently 

large, so we can decrease either the accuracy or the failure probability (or even 
both). It turns out that an approximation accuracy e tk — e(l — 7)/w is sufficient, 

so we decrease failure probability. Let us set 8' := O ( ^^ypq /log "^j^ J ■ 

With this choice of 8' and ko, (3(8') < e(l — "l)/m whenever k > ko, furthermore, 

2™exp ^— < 8', so we get that 

oo 

ko — 1 oo 

< £ 8' + £ 2«exp (-*£) 

fc — 1 A.: — fco 

^ , ( fc ° + 1 - cxp (-^) )^ e (^ m )- 

We can repeat this estimation for every state-action components (x[Tj],a). There 
are at most mNt\A\ of these, so the total failure probability is still less than 0(8). 
This means that 

9 E Pi 'hi < max ; V 7 =, -i ^ 

will hold for a/Z (x[Fj],a) pairs and all timesteps t with high probability. From 
now on, we will consider only realizations where the failure event does not happen, 
but bear in mind that all our statements that are based on ^ are true only with 
1 — 0(8) probability. 

From (O, we can easily get L\ bounds on the accuracy of the full transition prob- 

ability function: £ y£X |p(y|x, a) - P t (y|x, o)| < YZi ^( ^^^ ^) 
for all (x, a) € X x A and for all t. 



5.2.3. The known-state FMDP. A state-action component (x[Fj], a) is called known 
at timestep t if it has been visited at least ko times, i.e., if k t (x[Ti\. a) < kg. We 
define the known- component FMDP M ' as follows: (1) its state and action space, 
rewards, and the decompositions of the transition probabilities (i.e., the dependency 
sets Ti) are identical to the corresponding quantities of the true FMDP M, and 
hence to the current approximate FMDP M t ; (2) for all a € A, i € {1, ... , m} and 
x[r°] 6 X[r°], for any y t € X i} the corresponding transition probability component 
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iBP^CwIxtTiU) 



PtM-xpila), if (x[I\-],a) e K t ; 



Note that FMDPs Af Xt and M t are very close to each other: unknown state- 
action components have identical transition functions by definition, while for known 

(1-7) 



components, Y.yax, p ^v\A T i\^ a ) ~ p tAv\ x [ T il a 
all (x, a), 

e(l- 7 ) 



— m^Vo • Consequently, for 



(10) Y, ^(y|x,a)-P t (y|x,a) 

yex 



< 



7V0 



For an arbitrary policy 7r, let and v T be the value functions (the fixed points 
of the approximate Bellman equations) of 7r in M * and M t , respectively. By 
a suitable variant of the Simulation Lemma (see supplementary material) that 
works with the approximate Bellman equations, we get that whenever (fT0| holds, 
llv5-vJ|_ < e. 



I DC 



5.2.4. The FOIM model is optimistic. First of all, note that FOIM is not directly 
using the empirical transition probabilities Pt,i, but it is more optimistic; it gives 
some chance for getting to the garden of Eden state xe'- Pfi (2/i| x [Ti], a) = 



1 

kti+l ■ 



otherwise, 



where we introduced the shorthand kt.i — fct(x[Fi],a). 
Now, we show that 

Q x (x, a) - [i?(x, a ) + 7 ^ A FO/M (y|x, a)V x (y) 
yeA 

(11) <e(e(l- 7 )), 



or equivalently, 



^(P(y|x, a )-F t TO ™( y |x,a))y x (y) 



2—1 

Every term in the right-hand side is larger than — e(l — "f)/m, provided that we can 
prove the slightly stronger inequality 

First note that if the second term dominates the max expression, then the inequality 
is automatically true, so we only have to deal with the situation when the first 
term dominates. In this case, the inequality takes the form — -|= ■ 2Vq + -j^—Ve > 



*(l-7) 



, which always holds because of our choice of Re- 
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We show by induction that V® (x) > V x (x) - 9(e) and QW ( x , a) > Q x (x, a) - 
6(e) for all t = 0, 1, 2, . . . and all (x, a) € X x A The inequalities hold for £ = 0. 
When moving from step t to t + 1, 

Q(* +1 ) (x, a) = fl(x, a) + 7 ]T P t (y | x, aJV^ (y) 

yex 

> i?(x, a)+ 7 ^P t (y|x, a)(V x (y) - 6(e)) 
yex 

>Q x (x,a)- 7 e(e)-6((l- 7 )e) 

for all (x, a), where we applied the induction assumption and eq. Ijllj) . Conse- 
quently, max a gyi Q(* +1 )(x, a) > max agj 4 Q x (x, a) — 6(e) for all x. Note that ac- 
cording to our assumptions, all entries of H are nonnegative as well as the entries 
of G = Af(H T ), so multiplication by rows of HG is a monotonous operator, fur- 
thermore, all rows sum to 1, yielding 

V [HG] y . x max g< t+1 ) (x, a) 
xex 

> V [ffG] y , x (maxQ x (x,a)-6(e)), 
xex 

that is, 0* +1 )(x) > y x (x) - 6(e). 

5.2.5. Proximity of value functions. The rest of the proof is standard, so we give 
here a very rough sketch only. We define a cutoff horizon H := 0(j^ log ) 
and an escape event A which happens at timestep t if the agent encounters an 
unknown transition in the next H steps. We will separate two cases depending 
on whether Pr(A) is smaller than £( '] ? ~ 7 ' > or not. If the probability of escape is 
low, then we can show that Q /M (x t ,at) > Q x (x t ,at) — 6(e). Otherwise, if 
Pr(A) is large, then an unknown state-action component is found with significant 
probability. However, this can happen only at most mN f\A\k$ times (because all 
components become known after ko visits), which is polynomial, so the second case 
can happen only a polynomial number of times. 

Finally, we remind that the statements are true only with probability 1 — 6(5). 
To round off the proof, we note that we are free to choose the constant in the 
definition of Re (as it is hidden in the 6(-) notation), so we set it in a way that 
6(e) and Q(S) become at most e and S, respectively. 

6. Discussion 

FOIM is conceptually very simple: the exploration-exploitation dilemma is re- 
solved without any explicit exploration, action selection is always greedy. The model 
update and model solution are also at least as simple as the alternatives found in 
the literature. Further, FOIM has some favorable theoretical properties. FOIM is 
the first example to an RL algorithm that has a polynomial per-step computational 
complexity in FMDPs. To achieve this, we had to relax the near-optimality of the 
FMDP planner. The particular planner we used, FVI, runs in polynomial time, it 
does reach a bounded error, and the looseness of the bound depends on the quality 
of basis functions. In almost all time steps, FOIM gets e-close to the FVI value 
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function with high probability (for any pre-specified e). The number of timesteps 
when this does not happen is polynomial!! 

From a practical point of view, calling an FMDP model-solver in each iteration 
could be prohibitive. However, the model and the value function usually change 
very little after a single model update, so we may initialize FVI with the previous 
value function, and a few iterations might be sufficient. 
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Appendix A. The Proof of Theorem 5.1 
A.l. General lemmas. 

Lemma A.l. (Azuma's Inequality) If the random variables Xi,X%, . . . form a martingale 
difference sequence, meaning that E[Xk\X\, X2, ■ ■ ■ , Xk-i] = for all k, and \Xk\ < b for 
each k, then 



Pr 



')TX t > a 



and 



Pr 



k 



Y^ X i > a 



i=l 



< exp 



< 2 exp - 



2b 2 k 



2b 2 k 



Lemma A. 2 (Theorem 3 of [?]). Fix a probability factor i and a pair (x[Tj] , a) £ X[I\]xA. 
Let Pi(-|x[r"],a) be the empirical distribution of Pi(-|x[r\], a) after ki visits to (x[Pi],a). 
Then for all e\ > 0, the Li-error of the approximation will be small with high probability: 



Pr \ p i(y\ x \Fil a )-Pi(y\x-\Pi],a)\>ei\ < 2" exp 

Corollary A. 3. For any Si > 0, define 
(12) f3(5 1 ) : 

Then with probability at least 1 — Si 



k f 2 



2(log— +nlog2). 



E A(»|3ctTt],i 

yex t 

Lemma A. 4. Let 82 > 0, €2 > 0. Let 



Pi (2/|x[rd,o) < 



P(5i 



8' ■- C'5 2 el/log ^ — 



with some suitable constant C' . Then 



Pr ^(y\*Fr],a) - PtAy\x[r t ],o.)\ >max( 



0(8') 



e 2 ) for any t = 1,2, 



<S, 



■^Note that in general there may be some hard-to-reach states that are visited after a very long 
time only, so not all steps will be near-optimal after a polynomial number of steps. This issue was 
analyzed by [?], who defined an analogue of "probably approximately correctness" for MDPs. 
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that is, the probability is very low that the approximate transition probabilities ever get 
very far from their exact values. 

Proof. By the union bound, the above probability is at most 
(13) f>r ( \Pi(vM^i],a) ~ Pt,i(yM^i],a)\ > max(^p,e 2 ) ] . 



fc=i \yeXi v L I 

We will cut the sum into two parts, the cutting point fco is a constant to be determined 
later. Define the auxiliary constants 

1 



l-exp(^) 

5' 



5-2 



fco + e 

ra^ 1 i " i ' Mil' ,^ .-MU<un;x than 62 1 1 ! n i) i l J iu^- i Hill 1,^. 



Let ko such that ^7= becomes smaller than £2 after fco terms, that is, ^7= < £2, or 
equivalently, 



fco > = ^21ogi + nlog2 

= ^ ^21og(fc + e)+21og-i+nlog2 

Using the very loose inequality log x < cx — 1 — log c with c — j, we get that the above 
inequality holds if the stronger inequality 

1 4441/1 \ 
fco > -(fco + e) + - + -log- + - 21og— +nlog2 , 

2 4 e 2 2 e 2 2 ej \ S 2 ) 

holds, that is, for fco > Q ( log J . While this is a lower bound on fco, this also means 



that there is a constant C such that 



1 , 1 



(14) fc = C- log 

e 2 d 2£2 

satisfies the inequality. 

Using the above facts and Corollary IA.3I the sum of terms up to fco is bounded by 

£ Pr ( J2 |^(»l*tr 4 ],a)-fl kl ,(l/|x[r i ],a)| > ^j=\ 
k=i \yeXi 1 Vfci y 

< f] 5' = (k - 1)6' < -^8 2 
fci ko + e 

For the second part, note that the error probability for fco visits is at most 
2™ exp ^— ^y 2 -^ < 5' by the definition of fco. Therefore, by Lemma IA.2I the sum of 
terms above fco is at most 

v 2 ^ „n / kie 2 2 \ n ( k Q t\\ ^ / (fei-feo)el" 
^ 2 exp f -— 1=2 exp f -— ) ^ exp f 

fc' e z\ ^.c/ 1 _ cl c e 



< 5' GX P ( ^ ) < S ' 7 £ 2N = 5 ' e = 5 2 



l-exp(-f) ' fc <' + e 

Consequently, the full sum is at most fc ^ e <?2 + fco ^. e <?2 = &2- 
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To complete the proof, note that e = O (^r^j (which follows easily from the fact that 

1 + x < exp(a;) < 1 + 2x for x £ [0, 1]). Furthermore, recall that ko — Q ^4j- log j^^j , so 

8' = S 2 /0 (^- lo g^). that is . 8' = e {52(1 /log g^), as required. ■ 
The following lemma is almost identical to Corollary 1 of [?], the only change is that 
we allow different es and 8s for different components. The original proof of [?] carries 
through with this modification in an unchanged manner, so it is omitted here. 

Lemma A. 5. Fix a pair (x,a) 6Xx4. Suppose that all probability factors are approx- 
imated well in Li-norm, i.e., for all i, there exist £3^ > 0, 83^ > such that 

]T |ft( w |xrr<],a)-fi(wi|x|r 4 ],a)| <e 3 ,*. 
with probability at least 1 — 83^. Then 

m 

|P(y|x,a)-P(y|x,a)| <^>3,* 

y£X i=l 

with probability at least 1 — 83^. 

The previous lemma bounds the error for a single state. The following corollary extends 
the results, showing that the probability of a large approximation error anywhere in the 
state space is low. 

Lemma A. 6. Let 84 > 0, £4 > 0. Let 



m 3 N f \A\' b S 4 e4 
with some suitably defined constant C . Then with probability at least 1 — 8a, 

P(S') £4 , 



-P«(yl x ' a ) - -P(y| x > a ) < ^max( 



yex 

for any t = 1,2,... and any (x, a) £ X x A. 



y/kt(x\Ti],a) m' 



Proof. Fix a component (x[Fi],a) £ X[Ti] x A. By applying Lemma IA.4I to this 

(3(8') C4, 



component with €2 = ^ and 82 = mj ^| A | ; we get that 



(15) J2 \Pi(y\^i],a) - P M (y|x[r,],a) < max( 



Vfe(x[Ti],a) m 



for all t with probability at least 1 — ,„jv^|a| • There are mNf\A\ different components, 



so the probability that (|15[l is ever violated for any of them is still less than 84. If no 
components violate (|15[1 . then we can apply Lemma lA.51 to all (x, a) 6 X x A and all 
timesteps t = 1, 2, . . ., proving the statement of the lemma. ■ 

Definition A. 7 (known state-action components). For any £4 > 0, ^4 > 0, let 

m 2 m 2 N s \A\ 
KB(64,S4) := C-5-log 



84(4 

where C is a suitable constant defined by eq. {1$ . The pair (x[Ti],a) £ X[Fi] x A is 
(e4, <54)-known, if 

fc(x[r i ],a) > KB(e4,S 4 ). 

Note that KB(e4,S4) is the number of visit counts from which on the second term 
quantity in the maximum expressions dominates the first one. Therefore, after more than 
KB(e4,54) visits to a component, we can really feel confident that it is known: if all 
components were known, then all the approximate transition probabilities of the FMDP 
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would be within £4 Li-distance from the true values, with less than 84, total probability of 
an error. 



A. 2. Some bounds for value functions. Let us assume that all rewards fall between 
and -Rmax- In that case, the maximum possible value of a state is V ma! 



1-7 



Lemma A. 8. Consider an FMDP with transition functions {P a }, and let ir be an arbi- 
trary policy. Let v* the fixed point of iteration 

v* = HG n(; a) (r a + 7 P a v' r ) . 

aeA 

There exists a universal bound 



Vo ■= §-4v max = O ( / ' > '" 



(1-7) 



(l- 7 ) ! 



for which Hv 7 !^ < V . 



Proof. Let v"" the solution of the exact Bellman-equations: = ^2 aeA 7r(-, a) (r a + 
7 P a v' r ). Then 



(HG) *(;o) (r a + lP a ^) - E *(•- a ) ( r ° + ~1 pa ^ 



aeA 



(HG) J2 «) (r a + lP a ^) - (HG) J2 a) (r a + 7^) 



+ 



{HG) Y, A; a) (r° + 7 PV) -$>(-, a ) K + 

< 7ll^ll 00 l|v w -v-|| oo + ||(iJGK-v^|| oo 

< Tllv'-v'iL + llffGiuKiL + KiL , 

L < ifflU. Therefore, 



< llv'-v-IL + llv" 



< (, + 

1-7 



(3-7) 
(1-7) 



Lemma A. 9. The AVI-optimal value function v x is also bounded by Vo: 

IK II <y o- 

II Moo — 

Proof. For any G satisfying ||#G|I < 1, 



< llv x - v*|l + llv*IL < 



1-7 



\HGv*-v*\\ + ||v*| 



< T J_||/fG||„||v'IU + ^— llv 



1-7 



I CO II II OO 



1-7 



+ V 



< ( r L- + l)|| v '|| 00 <( r ^ + l)V m « = Vo 1 
where we used the triangle-inequality and eq. (5) of the main paper in the first line. 
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A.3. The known-state FMDP. 



Lemma A. 10 (Simulation lemma with function approximation). Let er, > 0. Consider 
two FMDPs M and M with joint transition probabilities P and P, otherwise identical. 
For any policy tt : X x A, consider the corresponding value functions and v-n- . 7/ 

es(l-7) 



P(y|x,a)-P(y|x,a) < 



7V0 



yex 

for all (x, a) € X x A, then 

||V W - VxlU < £5. 

Proof. Let A := ||v w — v^H^, and let (xa^a) be a state-action pair for which 
max aS A LP a v„- — P a v„- takes its maximum, i.e., 



7->a 7->a — 

max \ \P — P v„ 

aSA II 



]T P(y|x A ,o A )K(y) - P(y|x A ,o A )K(y) 

yex 



Using this, 



HG A; a) (r a + jP a v») - G ]T ir(-, a) (r a + 7 P a ^) 



< ||i/G|| oo7 max||p a v w -P a vJ 

< 7maxj|p a v 7r - P a vJ| 

a£A II lloo 



a€A 



= 7 



< 7 



X) p (yl x A, aA)K(y) - P(yjx A , o A )K(y) 



yex 



X [nylxA, oa) - P(y|xA, oa)] K(y) 



yex 



+ 7 



J2 P (y\xA,a A )[V4y)-V4y)] 



yex 



< 7 X^ |P(y|x A , cia) - P(y|x A ,OA)|l / o + 7 X P(yl XA > aA ) ll v f _ ^" 
yex y ex 

£6(1-7). 



< 



7V0 



-7V0 + 7A = (1 - 7)es + 7 A. 



Definition A. 11 (known-state FMDP). Let e > 0, 8 > fee arbitrary probabilities. Con- 
sider an FMDP M and a series of FMDPs Mt with joint transition probabilities P and Pt, 
otherwise identical. Furthermore, let K t be the set of (e, 8) -known (x[r"],a) pairs. Define 
the FMDP M Kt so that 

• its state and action space, rewards, and the decompositions of the transition prob- 
abilities (i.e., the dependency sets Ti) are identical to M and Mt, 

• for all a G A, i G {f , ... , m} and x[T"] G X[T"], for any yi G X it the correspond- 
ing transition probability component is 



P^iVi^Yila) 



Pt,i(j/i|x[Fi],a), if (x[r,], a) G Kt (known pairs) ; 
Pi(yi\x.[Ti],a), if (x[Tj], a) G" K t (unknown pairs). 



Lemma A. 12. Let ea > 0, Sa > 0. Suppose that FOLM is executed on an FMDP M = 
(X, A, P, R, 7 , {Ti}, {Zj ;}). If the initial value of the garden of Eden state is at least 



(16) 



Re>c- m - Kmax 



(l-7) 3 ee 



log 



mN f \A\ 



£6^6 
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with some constant c, then with probability at least 1 — 6e, 



(17) Q x (x,a) 



i?(x,a)+ 7 ^Pf° /M (y|x,a)\/ x (y) 
yex 



< £6 



Proof. By Lemma [A. 61 (with setting £4 = eg and #4 = $6), for any t = 1, 2, . . . and any 
(x, a) eXxi, 



/or aZZ (x, a) £ X x A and allt = 1,2, 



^ P(y|x,a) - P(y|x,a) < ^max( 



0(5') £ 6 . 



yex 

with probability at least 1 — 5$, where 



Vkt(^[r t ],a) 



m 



(18) t'^C'^^I^^P^- 
v ' m 3 N f \A\' S 6 e 6 

Fix a state-action pair (x, a), and let us use the shorthand kt,% = fct(x[F;],a) and (5 = 
f3{8'). Define the transition probabilities P™ /M as the empirical approximate probabilities 
with the hypothetical visit to the garden of Eden state. Note that 

pfOiM, j rp , , f ^fPtdVi I x[r 8 ],a), if Vi + x E ; 
'* 1 1 ' ) fc 1 +1 , otherwise, 



£(P(y I x,o)- Pf 0/M (y I x,a))l/ x (y) > -£inax(-|L, — ) • %±±V + t^. 

We will prove that every term in the right-hand side is larger than —e^/m. We are going 
to prove the slightly stronger inequality 

-max(-^=, ^) ■ 2Vo + t^-Ve > -- . 

Jkt,i m k t ,i m 

First of all, note that if the second term dominates the max expression, then the inequality 
is automatically true, so we only have to deal with the situation when the first term 
dominates. In this case, the inequality to prove becomes 

P 2Vo + -^V E >-^. 



yfkt^i k t ,i m 

After multiplication by kt,i and taking the derivative, we get that that the left-hand 
side takes its minimum where 

-m^= + - = 0, 



that is, for 

kt.. 



mf3V 

£6 

Substituting the minimum place to the inequality, we get that it always holds if 

fa 2Vo + -A—v B > 66 



m/3Vo (mf3V ) 2 m' 
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that is, if 
(19) 



Ve > -(Vo) 2 ? 2 

66 

3,r I ii , m 2 N f \A\ 

m „ m N r \A log — ; ' 
= J(V ) 2 (21og ''^ 2 666 + "'"^> 



= e 



= e 



mR„ 



(l-7) 4 f6 



? 2 

' max 



(l-7) 4 ee 



lo! 



log 



m 3 Af/|A| log 



m 2 Nj |A| 



which holds by the assumption of the lemma. During the transformations, we used the 
definition of j3 — P(S') in eq. ([12}, the definition of 5' in eq. (|18fl . the fact that Vo = 

® ( (f-j)i ) anc ^ that n is a small constant hidden by the O(-) notation. By noting that 
Ve ~ Re /(l — 7), the proof of the lemma is complete. 

■ 

The following result shows that FOIM preserves the optimism of the value function 
with high probability. 

Lemma A. 13. Let £7 > 0, 87 > 0. Suppose that FOIM is used with e 6 = (1 — 7)67, 5$ — 6-7 
and Re satisfying fib}) . Then, with probability at least 1 — 5r, V^'(x) > V x (x) — £7 and 
Q (t) (x, a) > Q x (x, a) - e 7 for all t = 0, 1, 2, . . . and all (x, a) £ X x A. 

According to Lemma IA.12I eq. (|17p holds for all (x, a) £ X x A and all t with high 
probability. So, except for an error event (with probability at most £7), We can proceed 
with the following induction on the number of DP-updates. Initially, V^°'(x) > V x (x) — £7. 
When moving from step t to t + 1, 



Q (t+1) (x,a) = 7?(x,a)+ 7 ^P t (y|x,a)V (t) (y) 

yex 

> i?(x,a)+ 7 ^P t (y|x,a)(\/ x (y)-e 7 ) 

yex 

> Q x (x,a)-7£7-(l-7)e7 

for all (x, a), where we applied the induction assumption and lemma |A. 121 Consequently, 

max Q'' +1 ' (x, a) > max Q x (x, a) — 67 

for all x. Note that according to our assumptions, all entries of H are nonnegative as well 
as the entries of G = JV(H T ), so multiplication by rows of HG is a monotonous operator, 
furthermore, all rows sum to 1, yielding 

^[ffG] y , x maxQ (t+1) (x,a) > ^ [HG] y , x (maxQ x ( x , a) - e 7 ), 
xfx xex 

that is, V (t+1) (x) > V x (x) - e 7 with prob. 1 - 5 7 . ■ 

A.4. Proximity of value functions. In the following, we will show that whenever the 
algorithm remains in the known region of the FMDP, its value function is very close to 
the AVI-optimal Q x . The two value functions will be related to each other through a 
sequence of other value functions. 

Let us fix £ > 0, 5 > 0, and let £g := f/4, S$ := 5/2. Let 

Re , 1 



H 



7 £8(1-7) 
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be the eg-horizon time. For a given point during the execution of the algorithm, let A 
denote the event that the algorithm will encounter an unknown transition in the next H 
steps. We will separate two cases depending on whether Pr(A) is small or large. Firstly, 
assume that 

Let M denote the true (and unknown) FMDP, and fix a pair (xi,ai) £ X x A. Let 
Qm 0/m (xi, 01) be the expected reward collected by FOIM in M, and let Qm >IM (xi, Oi, H) 
be the ff-step truncated version. 

Statement 1. Qff /M (xi, ai) > QjS >m (xi,a 1 ,H). 

Proof. Because of our assumption that all rewards are nonnegative, truncation removes 
only nonnegative terms. ■ 
Let M Kt be the known-state FMDP defined by the (e 8 , 5 8 )-known components. 
Statement 2. QfP JM (x!, ai, H) > Q™if ( Xl , a l5 #) - eg. 

Proof. On known states, M and M Kt are identical (by the definition of M Kt ), so the 
collected rewards are identical, too. If the algorithm encounters an unknown state-action 
pair, the difference of the two value functions may be as large as Re/(\ — 7). However, 
the probability that an unknown pair is found in the next H steps is at most 68 by 
assumption, so 

Qm (xijflsi, H) > Q M K t (xi,ai,H) - Pt(A)- > Q M K t (xi,oi, ff) - e s . 



Statement 3. Q^f(x u ai,H) > Q™if( Xl ,ai) - e 8 . 

Proof. This is a simple restatement of the fact that H is an es-horizon time. 
Let Mt be the approximate FMDP built by FOIM, and suppose that 



(20) Re 



(1 - 7) 4 £8 



log 



(1 - 7)e 8 <5 



Statement 4. (xi,Oi) > Q^ m (xi, ai) - e 8 . 

Proof. Follows from Lemma lA.lOl bv substituting . ■ 

Statement 5. If Re satisfies eg. (20\) . then Q~ )/M (xi, a±) > Q x (xi, ai) — eg with 
probability at least 1 — 5s ■ 

Proof. This is basically the statement of Lemma IA. 131 with the assignment e~j := e$ 
and Sr := Ss- I 

Summing up statements 1-5, we get that for Pr(A) < 68 #~ , 

(21) Q£ 0/M (xi,ai) > Q x ( Xl ,ai) - 4e 8 = Q x (xi,ai) - e 

with probability 1 — S$. 



A. 5. Finding unknown regions. We will now show that whenever 

(22) Pr(A) > 68(1 ~ 7) , 

He 

we will make a significant update to the model with relatively high probability. We will 
use the following simple consequence of the Hoeffding inequality [Thomas Walsh, personal 
communication] : 

Lemma A. 14. Suppose a weighted coin, when flipped, has probability p > of landing 
with heads up. Then, for any positive integer k and real number 8 g (0, 1), there exists a 
number m — 0{- log |), such that after m tosses, with probability at least 1 — 5, we will 
observe k or more heads. 
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Lemma A. 15. With probability 1 — 5, S22\) will occur at most 



( , .l 2 m ^m 4 N f \A\ log3 1 2 mN f \A\ 



e 4 (l-7) 4 ° 8 a e 
tames. 

Proof. Let A?a be the number of timesteps when (I22|l holds, and for all n = 1, . . . , -/Va, 

let 

{1, if an unknown pair (x t7l [Fi],a) £ K was encountered; 
0, otherwise. 

The number of model updates is simply Yl n =i at n > which is at most mNf \ A\ ■ KB(es , Ss) = 
0(mN f \A\ ■ KB(e,8)). On the other hand, by Lemma lATTH S«=i at ~ wiU be at least 
mA^I^I ■ KB(es,8g) with probability at least 1 — 5 after 

f mN f \A\-KB(e/4,5) 1\ = Q / R E mN f \A\- KB(e,6) 1 
V Pl ( A ) S J V e (!-7) <5 

V e 3 (l-7) 8 <T 8 <5e 

= yyi log 3i log ^ 

\ e (1 — 7) o e 

steps. 

Putting the two cases together, we get the following: 
Theorem A. 16. Let e > and 8 > 0, and suppose FOIM is initialized with 

8mN f \A\ 



R E =4c- mi?max 



log 



(1-7)6(5 



(l- 7 ) 4 e 

With probability at least 1 — 8, the number of timesteps when FOIM makes non-AVI-near- 
optimal moves, i.e., when 

QFOIM, \ . /-» X / \ 

(x t , a t ) < Q (x t , a t ) - e , 

is bounded by 

Proof. By eq. (|21[) . with probability 1 — 8/2, FOIM makes AVI-near-optimal moves 
whenever Pr(A) is small. The number of times this does not happen is bounded by 
Lemma |A. 151 with probability 1 — 8/2, and is exactly the bound given by the statement 
of the theorem. ■ 
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